Differentiation of granulomatous nodules with lobulation and spiculation signs from solid lung adenocarcinomas using a CT deep learning model

Background The diagnosis of solitary pulmonary nodules has always been a difficult and important point in clinical research, especially granulomatous nodules (GNs) with lobulation and spiculation signs, which are easily misdiagnosed as malignant tumors. Therefore, in this study, we utilised a CT deep learning (DL) model to distinguish GNs with lobulation and spiculation signs from solid lung adenocarcinomas (LADCs), to improve the diagnostic accuracy of preoperative diagnosis. Methods 420 patients with pathologically confirmed GNs and LADCs from three medical institutions were retrospectively enrolled. The regions of interest in non-enhanced CT (NECT) and venous contrast-enhanced CT (VECT) were identified and labeled, and self-supervised labels were constructed. Cases from institution 1 were randomly divided into a training set (TS) and an internal validation set (IVS), and cases from institutions 2 and 3 were treated as an external validation set (EVS). Training and validation were performed using self-supervised transfer learning, and the results were compared with the radiologists’ diagnoses. Results The DL model achieved good performance in distinguishing GNs and LADCs, with area under curve (AUC) values of 0.917, 0.876, and 0.896 in the IVS and 0.889, 0.879, and 0.881 in the EVS for NECT, VECT, and non-enhanced with venous contrast-enhanced CT (NEVECT) images, respectively. The AUCs of radiologists 1, 2, 3, and 4 were, respectively, 0.739, 0.783, 0.883, and 0.901 in the (IVS) and 0.760, 0.760, 0.841, and 0.844 in the EVS. Conclusions A CT DL model showed great value for preoperative differentiation of GNs with lobulation and spiculation signs from solid LADCs, and its predictive performance was higher than that of radiologists. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-024-12611-0.


Introduction
Solitary pulmonary nodules (SPNs) are round lesions < 3 cm in diameter that are surrounded by normal lung tissue without atelectasis, hilar enlargement, or pleural effusion [1].With the increasing use of lowdose CT screening and improved awareness of health examinations, the SPNs detection rate has increased significantly [2].However, despite the importance of distinguishing benign and malignant SPNs, some SPNs are difficult to assess qualitatively on the basis of radiologist assessments alone [3,4].Although most solid SPNs with lobulation and spiculation signs are lung adenocarcinomas (LADCs), some have been pathologically confirmed as granulomatous nodules (GNs) postoperatively [5,6].For the lobulation sign, up to25% were confirmed to be benign lesions; while 88-94% of the spiculation sign indicated malignancy, but a few, such as tuberculoma, were confirmed to be benign [6].PET/CT can evaluate the benign and malignant properties of SPNs on the basis of the presence or degree of fluorodeoxyglucose uptake.However, PET/CT is expensive, not sensitive to pulmonary nodules with a diameter of 8-10 mm, and may show false-negative or false-positive results for SPNs diagnoses [7,8].Although MRI can also distinguish benign and malignant lesions, it shows insensitivity for small lesions, low resolution of the lung structure, and the propensity to be easily disturbed by motion artifacts [9,10].Consequently, distinguishing GNs with lobulation and spiculation signs from solid LADCs has remained difficult in clinical practice.
In traditional imaging diagnosis based on human vision, many minor and important signs are easily overlooked, and the assessments are often subjective and empirical.Moreover, overlaps in the imaging features of different lesions can restrict even experienced radiologists from providing definitive diagnoses, and human observers cannot easily evaluate and predict deep-level information inside the tumor.These limitations highlight the need for informative, standardized, reproducible, and highly efficient methods to assist and enhance imaging diagnoses.
Deep learning (DL), an important branch of machine learning, involves learning representations from data with an emphasis on learning from connected layers corresponding to increasingly meaningful representations [11,12].Compared with traditional machine learning techniques, DL is capable of recognizing lesions in images with high accuracy and automatically extracting lesion features for end-to-end computation, which effectively avoids manual segmentation of lesions and complex non-automatic feature extraction processes [11][12][13].In this study, a CT DL model based on self-supervised transfer learning was constructed, and its predictive performance was compared with that of radiologists to explore its predictive value for distinguishing GNs with lobulation and spiculation signs from solid LADCs.

Study population
The inclusion criteria were as follows: (1) pathologically confirmed GNs or LADCs; (2) plain CT and contrast-enhanced examinations performed within 2 weeks before surgery and images reconstructed with slice thickness ≤ 2 mm; (3) lesions appearing as solid SPNs without calcification and fat inside, lobulation and spiculation signs on the margin, and a diameter of 8-30 mm; (4) availability of complete clinical and imaging data.
The exclusion criteria were as follows: ( Based on the above inclusion and exclusion criteria, we retrospectively recruited a total of 420 patients with a pathologically confirmed diagnosis of GNs or LADCs by surgical resection or puncture biopsy between June 2013 and February 2019 from three medical institutions.There were 307 cases in Institution 1 (211 LADCs and 96 GNs) and 113 cases in Institutions 2 and 3 (70 LADCs and 43 GNs).In this study, we randomly divided the cases from Institution 1 into training set (TS) and internal validation set (IVS), and used the cases from Institutions 2 and 3 as external validation set (EVS).The case screening process is depicted in Fig. 1.

CT image acquisition
Images were acquired at the three hospitals with Siemens Definition AS + 128-slice and 64-slice CT, Canon 640slice CT (Aquilion ONE Vision), and Ge Optima CT660 64-row 128-slice CT scanners.All patients underwent plain and enhanced CT; the scan range was from the tip of both lungs to the costophrenic angle bilaterally.Scanning parameters were as follows: tube voltage, 120 kV; automated tube current; acquisition matrix, 512 × 512; and field of view, 500 mm × 500 mm.The mediastinal and lung windows were reconstructed using standard algorithms.

Image preprocessing and lesion extraction
Image preprocessing: (1) the image was resampled using a linear interpolation algorithm, with the sampling layer thickness interval set to 1 mm (although the interpolation algorithm is somewhat detrimental to the detection of edge features, it ensures isotropy of the image voxels and has higher spatial positional precision [14] ); (2) the window width and window position were adjusted to 1400 HU and − 500 HU for all images.The 3D slicer software was applied to obtain the region of interest (ROI): first, the coordinates of the centre of the lesion were used as the datum, then coordinate points 1 mm above and 1 mm below the centre coordinates were determined, and finally a square ROI of 40 mm in diameter was obtained with each of these three points as the centre.Simultaneously, the GNs and LADCs were identified as benign and malignant, respectively.The ROI extraction process is shown in Fig. 2.

Definition of labels
Our study adopts a self-supervised learning approach [15], and its main process is divided into two stages: (1) self-supervised pretext task training: this stage entails the design of a pretext task, and a pseudo label for the pretext task is automatically generated for the unlabelled data based on certain attributes of the data (e.g., image panning, flipping, rotating, etc.), and after completing the training with the pseudo label, a network model that is capable of capturing the visual features of the image is obtained; (2)supervised downstream task training: after self-supervised pretext task training finished, the learned parameters serve as a pre-trained model and are transferred to other downstream computer vision tasks by fine-tuning.
It has been shown that the method of using rotation to generate pseudo labels in self-supervised learning can achieve good results in visual, audio, text and other tasks [16].In this study, the obtained ROI patches were transformed to generate pseudo labels by doing four angles (each patch was rotated counterclockwise by 0°, 90°, 180°, and 270°, respectively), which was done in order to enable the CNN to learn to recognise and detect the local features of the lesions [16].However, this approach somewhat ignores the overall features of the lesion.We designated the image rotation angle as the pseudo label and the lesion benignity and malignancy as the original label.In addition, the geometric transformation (i.e., rotation) of the image carried out by this step achieves data augmentation and avoids the problem of unbalanced samples in this study.

Model construction
ResNet50 is a residual network composed of many residual units connected in series.It solves the problems of gradient vanishing and gradient explosion that occur as the network deepens, allowing the network to deepen without degrading performance.Therefore, we choose ResNet50 network to construct the model [17].The residual unit's structure is shown in Supplementary Fig. 1.The structure of ResNet50 is shown in Fig. 3. Based on the ROI obtained above, DL models of non-enhanced CT (NECT), venous contrast-enhanced CT (VECT), and non-enhanced with venous contrast-enhanced CT (NEVECT) were established, and the IVS was used for internal verification with five cross-validation.Thus, all cases in institution 1 participated in the training and internal verification in this study.Finally, the model was validated using the EVS.The establishment process of the DL model is shown in Fig. 4.

Radiologist lesion assessment
Two junior radiologists with 3-5 years of work experience (radiologists 1 and 2) and two senior radiologists with 5-10 years of work experience (radiologists 3 and 4) independently evaluated the benign and malignant pulmonary nodules without foreknowledge of the pathology.When the diagnostic results were different, the radiologists discussed and reached a consensus, which was regarded as the Radiologist consensus results (RCR).The four radiologists also measured the lesion diameter, which was considered as the longest diameter measured in the horizontal axis position, and the average value of Fig. 2 Region of interest extraction process.The 3D slicer software was applied to obtain the region of interest (ROI): first, the coordinates of the centre of the lesion were used as the datum, then coordinate points 1 mm above and 1 mm below the centre coordinates were determined, and finally a square ROI of 40 mm in diameter was obtained with each of these three points as the centre.Since the lesions we included were pulmonary nodules with a diameter of < 30 mm, the ROIs obtained for each lesion were three slices containing the entire cross-section of the lesion the four radiologists' measured values was considered as the final lesion diameter.

Statistical analysis
SPSS 24.0 (IBM) and Medcalc (version 19.1.2.0) software were used for statistical analysis.The measurement data were expressed as mean ± standard deviation, and the counting data were expressed by frequency.T-tests of two independent samples were performed for comparison of measurement data, and the chi-square test was used for comparison of counting data.When p < 0.05, the difference was considered to be statistically significant.The Kappa test was used to test the consistency of the diagnostic results obtained by radiologists.The larger the kappa value, the better the consistency.
Area under curve (AUC) of receiver operating characteristic curve, 95% confidence interval(CI), sensitivity, and specificity were obtained to evaluate the diagnostic efficacy of DL models and radiologists.The Delong test was used to compare differences between different DL models and the diagnostic efficacy of radiologists.When p < 0.05, the differences were considered to be statistically significant.

General clinical data
On the basis of the inclusion and exclusion criteria, 420 patients with GNs and LADCs (LADCs, 281; GNs, 139; 231 males, 189 females; age range, 22-87 years; mean age, 55.93 ± 12.44 years).Samples of malignant and benign pulmonary nodules in this study are shown in Fig. 5. Cases from institution 1 were randomly divided into a TS and internal validation set IVS in a ratio of 7:3, while cases from institutions 2 and 3 were treated as the EVS.General clinical data of all patients are shown in Supplementary Table 1.Both the TS and IVS consisted of cases from institution 1 (n = 307), while cases of institutions 2 and 3 constituted the EVS (n = 113).The general clinical data of all three sets are shown in Supplementary Table 2.

Backbone network selection and performance of networks without self-supervised pretext task training
It has been reported that ResNet50 [17], DenseNet121 [18], Inception-v3 [19], ResNet18 [20] and VGG19 [21] networks are often used for classification tasks in the field of medical images and have achieved good performance.In order to select a more appropriate network, we pretrain with the above-mentioned network and parameters applied by previous researchers.The plain CT scan dataset (unrotated data) from Institution 1 was used for training and validation.The results show that the deep learning models built by DenseNet121, Inception-v3, ResNet18, ResNet50 and VGG19 networks have AUCs of 0.79, 0.83, 0.78, 0.86 and 0.81 in the validation set, with the best prediction performance for the ResNet50 network (AUC = 0.86).Therefore, we finally chose ResNet50 as the backbone network.The ROC curves of the deep learning models built by different networks in the validation set are shown in Fig. 6.

Deep learning model and radiologist prediction performance
For the DL models based on NECT, VECT, and NEVECT images, the AUCs in the IVS and EVS were, respectively, 0.917, 0.876, and 0.896, and 0.889, 0.879, and 0.881.In contrast, the AUCs of assessments performed by radiologists 1, 2, 3, and 4 in the IVS and EVS were, respectively, 0.739, 0.783, 0.883, and 0.772 and 0.760, 0.760, 0.841 and 0.844.In the IVS and EVS, AUCs the of radiologist concordant results were 0.772 and 0.785, respectively.In addition, the comparison revealed that the performance of the rotated pre-trained network was higher than the non-rotated one (AUC = 0.86).The AUC, 95% CI, sensitivity, and specificity of the DL models and radiologists are shown in Table 1.The receiver operating characteristic curves of the DL models are shown in Supplementary Fig. 2, and those of the radiologists are shown in Supplementary Fig. 3.

Radiologist diagnostic results and consistency test
Junior radiologists generally showed higher diagnostic accuracy for identifying solid LADCs than GNs, with a significant difference in the IVS (P < 0.001).However, the senior radiologists showed no definite patterns for Fig. 6 The ROC curves of the deep learning models built by different networks in the validation set Fig. 5 Examples of malignant and benign samples from this study.A: The margins of the lesion can be seen with lobulation and spiculation signs, and the lesion was confirmed to be lung adenocarcinoma after surgery.B: The lesion also has lobulation and spiculation signs, but the postoperative pathology is granuloma, this is not uncommon in clinical practice and is easily misdiagnosed as a malignant tumour by radiologists the diagnostic accuracy of both types of lesions.The radiologists' diagnostic results are shown in Supplementary Table 3.The diagnostic results of radiologists were tested for consistency.The diagnostic consistency among senior radiologists is high (highest kappa value, 0.801), the consistency among junior radiologists is medium (0.4 < kappa < 0.75), and the consistency between senior radiologists and junior radiologists is low (lowest kappa value, 0.282).The results of radiologists' diagnosis consistency test are shown in Supplementary Table 4.

Comparison of prediction performance
Delong test showed a significant difference in the prediction performance between the non-enhanced and venous contrast-enhanced CT DL models in the IVS (p = 0.001), with non-enhanced CT showing better prediction performance.No other significant differences were observed in the prediction performance of the DL models (all P > 0.05).Delong test also showed that the prediction performance of radiologists with the same experience level was not significantly different (all P > 0.05); the prediction performance of senior radiologists was higher than that of junior radiologists, and the difference was statistically significant in the IVS.The Delong test results of the predictive performance of the DL models and radiologists are presented in Supplementary Tables 5 and 6, respectively.

Discussion
Lung adenocarcinomas require early resection while granulomatous nodules do not, indicating the importance of preoperative identification of these lesions.The biological information of tumors can be characterized by specific CT signs [22], with lobulation [23] and spiculation [24] both often associated with malignant tumors.However, solitary pulmonary nodules showing these signs have been pathologically proven to be granulomatous nodules [5,6].Therefore, we used self-supervised transfer learning and the ResNet50 network to establish a deep learning model for distinguishing granulomatous nodules and solid lung adenocarcinomas.The model showed maximum area under the curve values of 0.917 and 0.889 in the internal and external validation sets, respectively, highlighting the usefulness of this model in distinguishing these lesions and thereby facilitating preoperative diagnosis.
Self-supervised learning is an unsupervised learning method in which the model uses information from the data to maximize its knowledge reserves [25].The accuracy of DL is directly proportional to the number of network layers within a reasonable range, but when the depth exceeds a certain threshold, gradient explosion and gradient dissipation problems can reduce the accuracy of the training set [22].Resnet50 can improve the system performance of the network while increasing the depth [26,27].Therefore, this study used self-supervised transfer learning and the ResNet50 network to establish a DL model.
Radiomics is susceptible to CT acquisition data, lesion segmentation, feature extraction and modelling methods.Unlike radiomics, DL extracts features through end-to-end deep convolutional neural networks, and as a data-driven algorithm, deep learning-based models can achieve higher performance by constructing large datasets.[28,29].A radiomics model for distinguishing GNs with lobulation and spiculation signs from solid LADCs has been previously reported [30], in which the AUCs of non-enhanced, venous contrast-enhanced, and nonenhanced with venous contrast-enhanced CT models in the validation set were 0.817, 0.837, and 0.841.However, the DL model in this study showed better performance for predicting GNs and LADCs, indicating that DL models offer more advantages than radiomics models for distinguishing GNs with lobulation and spiculation signs and solid LADCs before operation.
Many recent studies have also used DL for prediction of benign and malignant pulmonary nodules.Yang et al. [31] established a DL model to predict benign and malignant lung nodules (AUC = 0.84), while Feng et al. [32] performed a retrospective analysis of tuberculous GNs and LADCs and obtained AUCs of 0.889, 0.879, and 0.809, respectively, in the test set, IVS and EVS, respectively.This study focused on distinguishing GNs and solid LADCs, and the maximum AUCs in the IVS and EVS were 0.917 and 0.889, respectively, which are higher than the results obtained by Feng et al. [32].Thus, a DL model based on self-supervised transfer learning and Resnet50 shows great potential for distinguishing between GNs with lobulation and spiculation signs from solid LADCs and thereby facilitating preoperative diagnosis.
Our study also showed that diagnostic consistency was high, medium, and low among senior radiologists, among junior radiologists, and between senior and junior radiologists, highlighting inconsistencies in radiologist assessments.The lack of effective and unified diagnostic standards for distinguishing between these lesions and the presence of lobulation and spiculation signs can impair junior radiologists' judgment.Moreover, traditional image diagnosis is based on evaluating the morphological characteristics of the focus and prior knowledge, resulting in differences in the diagnoses performed by different radiologists.The prediction performance of senior radiologists was higher than that of junior radiologists, further highlighting the influence of diagnostic experience on radiologists' ability to distinguish between GNs with lobulation and spiculation signs and solid LADCs.
The AUCs of the DL model in this study were higher than the AUCs for radiologists, suggesting that the DL model shows obvious advantages over radiologist assessments in differentiating GNs and LADCs.In addition, the specificity of the DL was generally higher than that of the radiologists' , however, its diagnostic sensitivity was generally lower than that of the radiologists' .This may be related to the fact that the lung nodules included in this study had lobulation and spiculation signs.Lobulation and spiculation signs are usually indicative of malignancy, and radiologists are more likely to diagnose lung adenocarcinoma when these two signs are present in a lung nodule, influenced by subjective a priori knowledge.Therefore, radiologists have higher diagnostic sensitivity but lower specificity.Deep learning models, on the other hand, are completely data-driven and unaffected by subjective a priori knowledge, and thus have an advantage in diagnostic specificity.
In this study, the performance of the DL models with unenhanced CT was higher than that with CT enhanced scans, indicating that unenhanced CT may offer advantages over CT enhanced scans when using DL models.This may have occurred because the contrast agent residues in the tissue gaps of the focus on enhanced scans will interfere with the model's evaluation of the internal structure of the focus, resulting in degradation of its prediction performance.Therefore, in clinical practice, DL models with unenhanced CT are more applicable and cost-effective in the screening and follow-up of pulmonary nodules.

Limitations
Our study had some limitations.First, this was a retrospective study, which may have led to selection bias.Second, this study only compared GNs with LADCs, and did not include other inflammatory nodules, squamous cell carcinomas, and other tumors.Third, due to incomplete clinical data for parameters such as smoking history and tumor markers, this study did not analyze the influence of clinical risk factors on the predictive performance of the DL model.

Conclusion
In conclusion, a CT deep learning model could effectively distinguishing granulomatous nodules with lobulation and spiculation signs from solid lung adenocarcinomas, and its diagnostic performance and specificity are superior to those of radiologists.Therefore, in clinical practice, when radiologists encounter SPNs with lobulation and spiculation signs that are difficult to characterise, CT deep learning, as a noninvasive and highly repeatable approach, can help to assist radiologists with differential diagnosis in the preoperative period and provide a theoretical basis for development of appropriate clinical diagnosis and treatment plans.In addition, our study also found that the non-enhanced model performs better than the enhanced one.Therefore, this method can also be used in patients with contraindications to enhancement, such as contrast allergies.
analysis: YHW, WSW; Supervision and mentorship: YHW, WSW, YBG; Manuscript drafting or revision: YHW, WSW.Each author contributed important intellectual content during manuscript drafting or revision and accepts accountability for the overall work by ensuring that questions pertaining to the accuracy or integrity of any portion of the work are appropriately investigated and resolved.

Fig. 1
Fig.1The case screening process of this study

Fig. 3 Fig. 4
Fig.3The structure of ResNet50 and parameters during training.ResNet50 includes 49 convolution layers and one full connection layer from input to output, which can be divided into five stages.The structure of the first stage is relatively simple and can be regarded as the pretreatment of input.The last four stages are composed of bottlenecks, and their structures are relatively similar.CONV represents the convolution layer, which is used to extract features; Maxpool indicates the maximum pooling operation, which can avoid overfitting; Relu refers to the activation function, which can accept the signal output from the previous unit and convert it into a form that can be received by the next unit; BN refers to batch normalization processing, which can cut the image data to a specified size; BTNK in stages 1-4 represents the bottleneck structure, and each BTNK contains three convolution layers; FC layer represents the fully connected layer with the functions of combining features and classifying discriminations